Lets begin fresh:
rm(list=ls()) # cleans environment
graphics.off() # kills all plots
cat("\014") # cleans the console
Now, lets load the pacman package (this package conditionally loads uninstalled packages). The package avoids installing already installed packages. The way to load the package and library is as follows p_load(PACKAGENAME). Lets install pacman.
if (!require("pacman")) install.packages("pacman"); library(pacman)
Lets load the dataset:
p_load(foreign)
dat <- read.csv("data.csv",
header = T, # lets name the columns
sep = ";") # lets separate every entry by ";"
dat.deliverable = dat # this is to send you guys the row_nums for all cases where hits is NA. More on this, at the very end.
Lets summarize/inspect the data:
head(dat)
## row_num locale day_of_week hour_of_day agent_id entry_page path_id_set
## 1 988681 L6 Monday 17 1 2111 31672;0
## 2 988680 L2 Thursday 22 10 2113 31965;0
## 3 988679 L4 Saturday 21 2 2100 0;78464
## 4 988678 L3 Saturday 19 8 2113 51462
## 5 988677 L2 Tuesday 6 10 2116 31931;0
## 6 988676 L3 Monday 1 8 2100 0
## traffic_type session_durantion hits
## 1 6 7037 \\N
## 2 2 49 14
## 3 1 1892 14
## 4 6 0 1
## 5 1 2 3
## 6 1 0 2
summary(dat)
## row_num locale day_of_week hour_of_day
## Min. : 1 L1: 41568 Friday :127740 Min. : 0.0
## 1st Qu.:247171 L2:172904 Monday :158685 1st Qu.: 8.0
## Median :494341 L3:352454 Saturday :119350 Median :14.0
## Mean :494341 L4:131271 Sunday :144330 Mean :13.2
## 3rd Qu.:741511 L5:112054 Thursday :136476 3rd Qu.:19.0
## Max. :988681 L6:178430 Tuesday :155001 Max. :23.0
## Wednesday:147099
## agent_id entry_page path_id_set traffic_type
## Min. : 0.000 Min. :2100 0 : 54444 Min. : 1.000
## 1st Qu.: 6.000 1st Qu.:2111 38715;0: 11002 1st Qu.: 1.000
## Median : 9.000 Median :2113 34741;0: 8958 Median : 2.000
## Mean : 7.351 Mean :2253 34812;0: 7408 Mean : 2.774
## 3rd Qu.:10.000 3rd Qu.:2116 78506;0: 6242 3rd Qu.: 4.000
## Max. :15.000 Max. :8101 34227;0: 5614 Max. :10.000
## (Other):895013
## session_durantion hits
## 0 :115579 \\N :369446
## 2 : 43906 3 : 98274
## 3 : 40324 4 : 65158
## 4 : 33648 2 : 50675
## 5 : 26183 5 : 37135
## 1 : 21331 1 : 29381
## (Other):707710 (Other):338612
We could use View(dat) if we wanted to take a closer look. The dataset is very large, hence, it wouldn’t be that convenient.
colnames(dat)[9] # typo (i.e. session_durantion).
## [1] "session_durantion"
colnames(dat)[9] <- "session_duration" # now it's fixed.
dat$hits[dat$hits == "\\N"] <- NA
Now, lets drop all rows that contain NAs in the dataset. We don’t want to have NAs in the dependent variable either.
dat = na.omit(dat) # deletes NAs.
is(dat$hits)[1] # lets ask R what "hits" is.
## [1] "factor"
dat$hits = as.numeric(dat$hits) # for the rest of the assignment, will treat it as numeric
is(dat$session_duration)[1] # lets ask R what "session_duration" is.
## [1] "factor"
dat$session_duration = as.numeric(dat$session_duration) # better as numeric.
is(dat$agent_id)[1] # lets ask R what "agent_id" is.
## [1] "integer"
dat$agent_id = as.factor(dat$agent_id) # better as factor.
levels(dat$day_of_week) # see, levels are not ordered.
## [1] "Friday" "Monday" "Saturday" "Sunday" "Thursday" "Tuesday"
## [7] "Wednesday"
dat$day_of_week = factor(dat$day_of_week,levels(dat$day_of_week)[c(2,6,7,5,1,3,4)])
is(dat$entry_page)[1] # lets check what this variable type is.
## [1] "integer"
dat$entry_page = as.factor(dat$entry_page) # lets change that to factor.
Now, it seems that a large part of this distribution, is concentrated in landing pages 2100, 2111, 2113, 2114, 2115 and 2116.
table(dat$entry_page)[1:6]
##
## 2100 2111 2113 2114 2115 2116
## 110762 51932 205968 44589 10655 125338
# loading lattice, for a "nicer" histogram
p_load(lattice)
histogram(dat$entry_page, xlab = "Entry Page", breaks = length(levels(dat$entry_page)))
To make a more parsimonius model with less intercepts, I will recode this variable as follows. Will keep “landing pages” 1 to 6, but will create a 7th landing page coded as “0.” This residual category will group all other sessions, which many times, had just a couple of sessions only. Lets recode the variable.
levels(dat$entry_page)[levels(dat$entry_page)>"2116"] <- "0"
Now, lets plot the distribution of levels of the new variable:
p_load(lattice)
histogram(dat$entry_page)
Ok. This looks way better. Instead of having almost 150 landing pages, we have simplified this variable to the most important ones.
head(dat$path_id_set)
## [1] 31965;0 0;78464 51462 31931;0 0 0
## 151268 Levels: 0 ... 99998;0
It makes more sense to think of this variable as a count of “locations that were visited during the session” rather than the locations themselves. This is because is hard to model as a function of really specific and almost unique locations.
First, lets count the number of strings in this (as is) character variable. And then, lets sum up all locations.
p_load(stringr)
dat$path_id_set = str_count(as.character(dat$path_id_set), '\\d+')
Now the variable is a count variable.
Note: I assumed that the number 0, or “location 0” (very frequent, btw) was an actual location among other “locations that were visited during the session”.
That said, now “path_id_set” is almost only 2’s. That is, almost everyone visited two locations.
p_load(lattice)
histogram(dat$path_id_set, breaks = 100)
To make a simpler model, I will dichotomize “path_id_set”: now observations are either 2 (1) or not (0). Either “i” visited two locations, or not.
dat$path_id_set = as.numeric(dat$path_id_set == 2)
p_load(lattice)
histogram(as.factor(dat$path_id_set))
is(dat$traffic_type)[1] # is numeric.
## [1] "integer"
dat$traffic_type = as.factor(dat$traffic_type) # Better as factor.
dat$day_of_week.num = as.numeric(as.factor(dat$day_of_week))
# test
data.frame(
old = dat$day_of_week,
new = dat$day_of_week.num
)[100:110,] # from row 100 to row 110. It looks OK.
## old new
## 100 Monday 1
## 101 Monday 1
## 102 Thursday 4
## 103 Saturday 6
## 104 Wednesday 3
## 105 Wednesday 3
## 106 Wednesday 3
## 107 Thursday 4
## 108 Saturday 6
## 109 Friday 5
## 110 Monday 1
p_load(lattice)
histogram(dat$hits) # plot the DV
First impressions: of course, it doesn’t look like a normal distribution. It can’t be, anyways. The data generating process (the “number of hits”) might follow a Poisson distribution. That is, discrete positive numbers. But lets take a closer look at dependent variable. Normal distributions have the same mean, mean, and mode. Lets see:
mean(dat$hits) # Mean
## [1] 313.0943
median(dat$hits) # Median
## [1] 286
getmode <- function(v) { # Little function to get the mode.
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
getmode(dat$hits) # Mode
## [1] 286
OK. While visual inspection suggests that the DV is not normally distributed, the summary statistics shown above suggest that hits is pretty much normally distributed. In any case, when I get to the modeling stage, I will still consider a Poisson and a Negative Binomial models (and their corresponding diagnostics) for completeness.
Few more modifications before we start modeling: Variable names mess with both TeX and HTML KNITR compilation. Will replace "_" for “.”
colnames(dat)[which(names(dat) == "row_num")] <- "row.num"
colnames(dat)[which(names(dat) == "day_of_week")] <- "day.of.week"
colnames(dat)[which(names(dat) == "agent_id")] <- "agent.id"
colnames(dat)[which(names(dat) == "entry_page")] <- "entry.page"
colnames(dat)[which(names(dat) == "path_id_set")] <- "path.id.set"
colnames(dat)[which(names(dat) == "traffic_type")] <- "traffic.type"
colnames(dat)[which(names(dat) == "session_duration")] <- "session.duration"
colnames(dat)[which(names(dat) == "day_of_week.num")] <- "day.of.week.num"
colnames(dat)[which(names(dat) == "hour_of_day")] <- "hour.of.day"
save(dat, file = "dat.RData")
In this section, I will fit the data using several models and strategies. My general approach will be the following:
Lets define the full equation.
formula.all = as.formula(
hits ~
locale +
hour.of.day +
agent.id +
entry.page +
path.id.set +
traffic.type +
session.duration +
day.of.week.num
)
Lets fit the data via a linear model.
all.vars.model.ols = lm(formula.all, dat)
Now, before discussing the output of the unrestricted model, lets first find the most efficient subset of variables. Criteria will be minimizing the RMSE. Next, I use the ols_step_best_subset function. Usually stepwise regression selects variables with the lowest p-values. Hence, I decided to take a slighlty different approach by focusing on the RMSE instead.
Note: (a) this might take a while, depending on your machine. Also, (b) the outout is very wide, and might not fit the PDF. That’s why I’m sending also a HTML file, which you guys may see it from almost every web browser. Most importantly, the column we care about (MSEP: Estimated error of prediction), might not appear printed on the PDF.
p_load(olsrr) # for "ols_step_best_subset" function.
ols_step_best_subset(all.vars.model.ols)
## Best Subsets Regression
## ---------------------------------------------------------------------------------------------------------------
## Model Index Predictors
## ---------------------------------------------------------------------------------------------------------------
## 1 session.duration
## 2 entry.page session.duration
## 3 entry.page path.id.set session.duration
## 4 entry.page path.id.set traffic.type session.duration
## 5 agent.id entry.page path.id.set traffic.type session.duration
## 6 locale agent.id entry.page path.id.set traffic.type session.duration
## 7 locale hour.of.day agent.id entry.page path.id.set traffic.type session.duration
## 8 locale hour.of.day agent.id entry.page path.id.set traffic.type session.duration day.of.week.num
## ---------------------------------------------------------------------------------------------------------------
##
## Subsets Regression Summary
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## 1 0.0294 0.0294 0.0294 20737.3409 8395513.2046 6638200.2677 8395547.2134 45251.8443 45251.8443 0.0731 0.9706
## 2 0.0493 0.0493 0.0493 7599.9994 8382679.2163 6625356.3381 8382781.2425 44323.2691 44323.2691 0.0716 0.9507
## 3 0.0571 0.0571 0.057 2507.2953 8377627.6694 6620304.8320 8377741.0318 43963.1630 43963.1630 0.0710 0.9429
## 4 0.0586 0.0586 0.0586 1477.5817 8376611.2496 6619278.4207 8376792.6294 43890.7061 43890.7061 0.0709 0.9414
## 5 0.0601 0.0601 0.06 505.5815 8375666.7891 6618307.9743 8376006.8763 43822.8948 43822.8948 0.0708 0.9399
## 6 0.0609 0.0609 0.0608 -12.6314 8375156.7723 6617789.9677 8375553.5407 43786.5333 43786.5333 0.0707 0.9391
## 7 0.0609 0.0609 0.0608 -18.0892 8375151.3141 6617784.5098 8375559.4188 43786.1473 43786.1473 0.0707 0.9391
## 8 0.0609 0.0609 0.0608 -18.0000 8375151.4032 6617784.5990 8375570.8441 43786.1536 43786.1536 0.0707 0.9391
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria
The MSEP carries almost the same information as the RMSE. You can try calculating the square root of the MSEP to obtain the RMSE. Hence I’ll use the MSEP to select the most efficient variables. Models 7 and 8 have statistically speaking, identical predicted errors. Since I’m just scratching the surface now, I will select model 8 which has all possible variables, and begin modeling/cleaning from there.
But lets now calculate the actual RMSE of the full model.
sqrt(rev(anova(all.vars.model.ols)$"Mean Sq")[1]) # RMSE
## [1] 209.2499
OK. Now lets see how we are doing good-of-fit wise.
p_load(lattice)
xyplot(predict.lm(all.vars.model.ols, type="response") ~ dat$hits,
xlab = "Observed Hits",
ylab = "Predicted Hits",
main = "Model 1: Observed v. Fitted")
Not so good. Lots of influence (upper-left corner of plot) and leverage problems here—pretty much, just across all the x-axis. My take, not a good fit (i.e. not a good enough RMSE).
Lets now plot the error to check normality and homoscedasticity, that is, constant variance on the predicted residuals. Given the poor fit, we can’t anticipate promising results at this stage.
p_load(lattice)
xyplot(all.vars.model.ols$residuals ~ dat$hits,
xlab = "Observed Hits",
ylab = "Residuals",
main = "Model 1: Residual Plot")
Huge problems here. We should be able to see no patterns, which we do: the residuals go up as the X variable goes up too. That’s not a good sign.
One problem of this dataset is that it might have extreme observations (outliers). In this section, I will try to detect those, and evaluate what can we do with those.
Lets get Cook’s distance (a typical measurement of influencial data points), and merge it into our main dataset.
options(scipen=100000000) # increases threshold for scientific notation (it's got very small numbers).
dat$cooks.all.ols <- cooks.distance(all.vars.model.ols)
Lets also summarize the Cooks
summary(dat$cooks.all.ols)
## Min. 1st Qu. Median Mean 3rd Qu.
## 0.0000000000 0.0000000884 0.0000005079 0.0000014147 0.0000016880
## Max.
## 0.0018523941
And also, lets plot the CooksD.
p_load(lattice)
xyplot(dat$cooks.all.ols ~ predict(all.vars.model.ols, type="response"),
grid = TRUE,
xlab = "Predicted Hits",
ylab = "Cooks D",
main = "Model 1: Influencial Observations (Cooks D)",
panel = function(x, ...) {
panel.xyplot(x, ..., alpha = 0.3)
}
)
My recomnendation is that we drop all observations that fall above the 3rd quantile. That would mean to keep 464426 observations. In big data settings, that’s descent enough.
Lets subset the data then.
dat.cook = subset(dat, cooks.all.ols < as.numeric(summary(dat$cooks.all.ols)[5]))
Now we have a smaller DF, but with less influence problems. Lets fit Model 2, all variables, but using the better dataset (i.e. dat.cook).
all.vars.model.ols.cook = lm(formula.all, dat.cook)
OK. Now, lets inspect Cooks distance again.
options(scipen=100000000) # increases threshold for scientific notation (it's got very small numbers).
dcooks.distance.ols.fixed <- cooks.distance(all.vars.model.ols.cook) # getting Cook's distance.
# plot
p_load(lattice)
xyplot(dcooks.distance.ols.fixed ~ predict(all.vars.model.ols.cook, type="response"),
grid = TRUE,
xlab = "Predicted Hits",
ylab = "Cooks D",
main = "Model 2: Influencial Observations (Influential Obs. Dropped)",
panel = function(x, ...) {
panel.xyplot(x, ..., alpha = 0.3)
}
)
Now the problem can be tolerated. But there are still some problems. More on that below. Lets now inspect the RMSE of Model 2.
sqrt(rev(anova(all.vars.model.ols.cook)$"Mean Sq")[1]) # RMSE
## [1] 133.3775
Not surpsingly, this RMSE is way better than Model 1. Lets see now how we are doing goodness-of-fit wise.
p_load(lattice)
xyplot(predict(all.vars.model.ols.cook, type="response") ~ dat.cook$hit,
grid = TRUE,
xlab = "Observed Hits",
ylab = "Predicted Hits",
main = "Model 2: Full Model Without Most Influencial Observations",
panel = function(x,
y, ...) {
panel.xyplot(x, y, ...)
panel.lmline(x, y, ...)
}
)
OK. It does look better now. However, we now see that we also have leverage issues.
Now I will concentrate on the DFFITs.
p_load(olsrr)
ols_plot_dffits(all.vars.model.ols.cook)
Lots of problems here. Lets drop all observations that landed below/above the threshold. First, lets calculate the DFFFITs, so we merge them into our dataset.
dat.cook$dffits.ols = dffits(all.vars.model.ols.cook)
Second, lets plot the distribution of DFFITs, so we actually see what we are about to drop.
densityplot(dat.cook$dffits.ols,
xlab = "Dffits",
main = "Density of Dffits Resulting of Fitting Model 2\n(Without Most Influencial Observations Dropped)",
panel = function(...) {
panel.densityplot(...)
panel.abline(v = c(0.02, -0.02))
}
)
The main take away is the following: if we chop the top ends of the distribution (i.e. the ones with high leverage), we still get the most of information out it. Lets drop these data points with high leverage (306 observations). Actually, is not a lot, but lets see if we improve fit, decreasing error, and getting a better RMSE.
First, lets drop data points with high/low leverage:
dat.cook.dffits = subset(dat.cook, dffits.ols > - 0.02 & dffits.ols < 0.02)
Next, lets fit Model 3, now without high/low DFFITS (and certainly, without high Cooks Distance values).
all.vars.model.ols.cook.dffits = lm(formula.all, dat.cook.dffits)
Now, lets get those DFFITs, so we can plot them.
options(scipen=100000000) # increases threshold for scientific notation (it's got very small numbers).
dffits.dcooks.distance.ols.fixed <- dffits(all.vars.model.ols.cook.dffits) # getting DFFITs.
# plot
p_load(lattice)
xyplot(dffits.dcooks.distance.ols.fixed ~ predict(all.vars.model.ols.cook.dffits, type="response"),
grid = TRUE,
xlab = "Predicted Hits",
ylab = "DFFITS",
main = "Full Model with Influential Obs. Dropped\n(Low CooksD, and no extreme DFFITs)",
panel = function(x, ...) {
panel.xyplot(x, ..., alpha = 0.3)
}
)
Now the problem can be tolerated. Lets get now those RMSEs.
sqrt(rev(anova(all.vars.model.ols.cook.dffits)$"Mean Sq")[1]) # RMSE
## [1] 133.399
Not much better. It’s almost the same as Model 2, at the price of loosing a couple hundreads observations. In fact, the RMSE is a little bit higher. Lets insepct fit.
p_load(lattice)
xyplot(predict(all.vars.model.ols.cook.dffits, type="response") ~ dat.cook.dffits$hit,
grid = TRUE,
xlab = "Observed Hits",
ylab = "Predicted Hits",
main = "Model 3: Full Model Without Most Influencial Observations",
panel = function(x,
y, ...) {
panel.xyplot(x, y, ...)
panel.lmline(x, y, ...)
}
)
OK, so far we’ve fitted three models (full, good Cooks, and good Cooks and DFFITs). Lets use this truncated DF (i.e. dat.cook.dffits) to take the stepwise regression route.
p_load(olsrr)
ols_step_best_subset(all.vars.model.ols.cook.dffits)
## Best Subsets Regression
## ---------------------------------------------------------------------------------------------------------------
## Model Index Predictors
## ---------------------------------------------------------------------------------------------------------------
## 1 entry.page
## 2 entry.page session.duration
## 3 entry.page path.id.set session.duration
## 4 entry.page path.id.set traffic.type session.duration
## 5 agent.id entry.page path.id.set traffic.type session.duration
## 6 locale agent.id entry.page path.id.set traffic.type session.duration
## 7 locale hour.of.day agent.id entry.page path.id.set traffic.type session.duration
## 8 locale hour.of.day agent.id entry.page path.id.set traffic.type session.duration day.of.week.num
## ---------------------------------------------------------------------------------------------------------------
##
## Subsets Regression Summary
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## 1 0.0990 0.0990 0.099 58491.2203 5914415.0189 4597289.0421 5914503.4021 20038.2107 20038.2107 0.0432 0.9010
## 2 0.1757 0.1757 0.1757 14007.7060 5873126.6508 4556000.9693 5873226.0819 18332.5917 18332.5917 0.0395 0.8243
## 3 0.1911 0.1910 0.191 5118.2793 5864416.3449 4547290.7535 5864526.8239 17991.7458 17991.7458 0.0388 0.8090
## 4 0.1945 0.1945 0.1945 3109.0250 5862434.7191 4545299.1487 5862611.4855 17914.8982 17914.8982 0.0386 0.8055
## 5 0.1972 0.1972 0.1972 1536.1104 5860895.6080 4543734.0647 5861227.0450 17855.0873 17855.0873 0.0385 0.8028
## 6 0.1999 0.1998 0.1998 -2.4031 5859367.5913 4542198.0878 5859754.2678 17796.2465 17796.2465 0.0383 0.8001
## 7 0.1999 0.1999 0.1998 -18.0936 5859351.8998 4542182.3969 5859749.6242 17795.6448 17795.6448 0.0383 0.8001
## 8 0.1999 0.1999 0.1998 -18.0000 5859351.9933 4542182.4906 5859760.7655 17795.6484 17795.6484 0.0383 0.8001
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria
If we want to maximize parsimony and minize RMSE, now model 6 does the best job (with 2 less predictors now). Lets define a new formula then, with the best predictors.
formula.best = as.formula(hits ~ locale + agent.id + entry.page + path.id.set + traffic.type + session.duration)
Lets use this formula to fit Model 4, using the dat.cook.dffits dataset.
best.vars.model.ols = lm(formula.best, dat.cook.dffits)
Lets inspect goodness of fit.
p_load(lattice)
xyplot(predict(best.vars.model.ols, type="response") ~ dat.cook.dffits$hit,
grid = TRUE,
xlab = "Observed Hits",
ylab = "Predicted Hits",
main = "Model 4: Best Model Without Most Influencial Observations",
panel = function(x,
y, ...) {
panel.xyplot(x, y, ...)
panel.lmline(x, y, ...)
}
)
Checking homoscedasticity.
ols_plot_cooksd_bar(best.vars.model.ols) # Cooks D
ols_plot_dffits(best.vars.model.ols) # DFFITs
Cooks D look good. DFFITs look just OK, but good enough. Most leverage observations land just below/above the threshold. We do have around 10 observations (little less perhaps) landing way below the negative threshold.
Lets get now those RMSEs.
sqrt(rev(anova(best.vars.model.ols)$"Mean Sq")[1]) # RMSE
## [1] 133.4016
Not much better. It’s almost the same as Model 3, after loosing a couple hundreads obs. In fact, the RMSE is a little bit higher. Lets try a different strategy.
Now, lets turn to the robust regression. These models weight the contribution of every data point by the size of the standard error: the more far away a data point is, the less “importance” that data point will have. This strategy works in cases like this, where we have leverage/influence issues.
dat$agent.id = as.numeric(as.character(dat$agent.id)) # convert this, as it messes up with the LaTex dollar sign.
p_load(MASS)
rlm.best.vars.entire.df = rlm(
hits ~ locale + agent.id + entry.page + path.id.set + traffic.type + session.duration,
data = dat,
psi = psi.bisquare
)
Lets evaluate the RME.
sqrt(rev(anova(rlm.best.vars.entire.df)$"Mean Sq")[1]) # RMSE
## [1] NA
Not a very good job. Lets consider Model 6, where we employ the same robust approach, but with the truncated dataset.
dat.cook.dffits$agent.id = as.numeric(as.character(dat.cook.dffits$agent.id))
p_load(MASS)
rlm.best.vars.truncated.df = rlm(
hits ~ locale + agent.id + entry.page + path.id.set + traffic.type + session.duration,
data = dat.cook.dffits,
psi = psi.bisquare
)
Lets now calculate the RMSE
p_load(DMwR)
as.numeric(regr.eval(dat$hits, rlm.best.vars.entire.df$fitted.values)[3]) # RMSE
## [1] 209.9251
OK; this is better but not the the best (compared to Model 2).
Lets consider now GLMs. We will focus on two kinds, Poisson and Negative Binomial.
Does hits look like a (theoretical) Poisson distribution?
p_load(vcd)
distplot(dat$hits, type="poisson")
When the empirical distribution looks alike the theoretical one, we should see a straight line (as we do here). To answer my own question, yes, it definetively looks def like a Poisson distribution.
Lets inspect if hits look like a Negative Binomial distribution.
p_load(vcd)
distplot(dat$hits, type="nbinom")
In this case, it definetively does not look like a Negative Binomial distribution. Lets anyways fit both models.
First, we will do a stepwise regression fitting a Poisson model. Lets run the full model first.
full.model.poisson = glm(hits ~ locale +
hour.of.day +
agent.id +
entry.page +
path.id.set +
traffic.type +
session.duration +
day.of.week.num,
data = dat,
family = poisson(link = "log")
)
Second, lets proceed with the stepwise, and try to find the most efficient predictors.
step(full.model.poisson, direction = "both")
## Start: AIC=100012561
## hits ~ locale + hour.of.day + agent.id + entry.page + path.id.set +
## traffic.type + session.duration + day.of.week.num
##
## Df Deviance AIC
## <none> 95624062 100012561
## - day.of.week.num 1 95624226 100012723
## - hour.of.day 1 95624795 100013292
## - agent.id 1 95658954 100047451
## - locale 5 95695021 100083509
## - traffic.type 6 95810566 100199052
## - path.id.set 1 96343996 100732493
## - entry.page 6 96576973 100965459
## - session.duration 1 97670923 102059420
##
## Call: glm(formula = hits ~ locale + hour.of.day + agent.id + entry.page +
## path.id.set + traffic.type + session.duration + day.of.week.num,
## family = poisson(link = "log"), data = dat)
##
## Coefficients:
## (Intercept) localeL2 localeL3 localeL4
## 5.33669114 0.03576591 0.03406074 -0.02277977
## localeL5 localeL6 hour.of.day agent.id
## 0.02879158 0.02255320 -0.00029455 -0.00365396
## entry.page2111 entry.page2113 entry.page2114 entry.page2115
## 0.19329554 0.17038229 0.23047331 0.19451043
## entry.page2116 entry.page0 path.id.set traffic.type2
## 0.28444828 0.18955289 0.15919481 0.01477518
## traffic.type3 traffic.type4 traffic.type6 traffic.type7
## -0.06433595 -0.02247284 -0.06719006 -0.06611249
## traffic.type10 session.duration day.of.week.num
## -0.15655863 0.00003122 0.00045224
##
## Degrees of Freedom: 619234 Total (i.e. Null); 619212 Residual
## Null Deviance: 101200000
## Residual Deviance: 95620000 AIC: 100000000
This procedure suggest selecting all variables (as done in full.model.poisson). Lets inspect the RMSE.
p_load(qpcR)
RMSE(full.model.poisson)
## [1] 12.4267
It turns out that this RMSE is the lowest one so far. However, we should remember that RMSEs across linear and GLMs are not directly comparable; i.e. variances in GLM’s predictions aren’t constant. Lets check for overdispersion anyways.
p_load(AER)
dispersiontest(full.model.poisson)
##
## Overdispersion test
##
## data: full.model.poisson
## z = 584.04, p-value < 0.00000000000000022
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion
## 142.6257
There is evidence of overdispersion (estimated to be 142.4094), violating the assumption of equidispersion. Lets now fit a Negative Binomial for completeness.
nbinom.full = glm.nb(formula.all, data = dat)
Lets now inspect the RMSE.
p_load(qpcR)
RMSE(nbinom.full)
## [1] 1.065472
It’s better than the Poisson regression. Overall, the evidence employing GLMs is not conclusive. The data generating process seems to be Poisson, but we do have overdispertion problems.
Lets take a quick look at the coefficients, standard errors, and r-squares.
The table summarizes all models estimated in this case study.
Lets now take a quick look at the plots.
p_load(ggplot2,dvmisc,qpcR)
all.vars.model.ols.plot = ggplot(dat, aes(
x = hits,
y = predict.lm(all.vars.model.ols, type="response"))) +
geom_smooth(method = "lm", colour = "red") +
geom_jitter(alpha = .01, width = 80, height = 10, size=0.01, shape=23) +
theme_bw() +
ggtitle("M1: All Variables, Full Dataset (OLS)") +
scale_y_continuous(name='Hits Predicted') +
scale_x_continuous(name='Hits Observed') +
labs(subtitle = paste("RMSE: ", round(sqrt(get_mse(all.vars.model.ols)), 3))) +
theme(axis.text.y = element_text(size=7),
axis.text.x = element_text(size=7),
axis.title.y = element_text(size=7),
axis.title.x = element_text(size=7),
legend.text=element_text(size=12),
legend.title=element_text(size=12),
plot.title = element_text(size=7),
legend.position="bottom")
all.vars.model.ols.cook.plot = ggplot(dat.cook, aes(
x = hits,
y = predict.lm(all.vars.model.ols.cook, type="response"))) +
geom_smooth(method = "lm", colour = "red") +
geom_jitter(alpha = .01, width = 80, height = 10, size=0.01, shape=23) +
theme_bw() +
ggtitle("M2: All Variables, Fixed CooksD Dataset (OLS)") +
scale_y_continuous(name='Hits Predicted') +
scale_x_continuous(name='Hits Observed') +
labs(subtitle = paste("RMSE: ", round(sqrt(get_mse(all.vars.model.ols.cook)), 3)))+
theme(axis.text.y = element_text(size=7),
axis.text.x = element_text(size=7),
axis.title.y = element_text(size=7),
axis.title.x = element_text(size=7),
legend.text=element_text(size=12),
legend.title=element_text(size=12),
plot.title = element_text(size=7),
legend.position="bottom")
all.vars.model.ols.cook.dffits.plot = ggplot(dat.cook.dffits, aes(
x = hits,
y = predict.lm(all.vars.model.ols.cook.dffits, type="response"))) +
geom_smooth(method = "lm", colour = "red") +
geom_jitter(alpha = .01, width = 80, height = 10, size=0.01, shape=23) +
theme_bw() +
ggtitle("M3: All Variables, Fixed CooksD and DFFITs Dataset (OLS)") +
scale_y_continuous(name='Hits Predicted') +
scale_x_continuous(name='Hits Observed') +
labs(subtitle = paste("RMSE: ", round(sqrt(get_mse(all.vars.model.ols.cook.dffits)), 3))) +
theme(axis.text.y = element_text(size=7),
axis.text.x = element_text(size=7),
axis.title.y = element_text(size=7),
axis.title.x = element_text(size=7),
legend.text=element_text(size=12),
legend.title=element_text(size=12),
plot.title = element_text(size=7),
legend.position="bottom")
best.vars.model.ols.plot = ggplot(dat.cook.dffits, aes(
x = hits,
y = predict.lm(best.vars.model.ols, type="response"))) +
geom_smooth(method = "lm", colour = "red") +
geom_jitter(alpha = .01, width = 80, height = 10, size=0.01, shape=23) +
theme_bw() +
ggtitle("M4: Best Variables, Fixed CooksD and DFFITs Dataset (OLS)") +
scale_y_continuous(name='Hits Predicted') +
scale_x_continuous(name='Hits Observed') +
labs(subtitle = paste("RMSE: ", round(sqrt(get_mse(best.vars.model.ols)), 3))) +
theme(axis.text.y = element_text(size=7),
axis.text.x = element_text(size=7),
axis.title.y = element_text(size=7),
axis.title.x = element_text(size=7),
legend.text=element_text(size=12),
legend.title=element_text(size=12),
plot.title = element_text(size=7),
legend.position="bottom")
rlm.best.vars.entire.df.plot = ggplot(dat, aes(
x = hits,
y = predict.lm(rlm.best.vars.entire.df, type="response"))) +
geom_smooth(method = "lm", colour = "red") +
geom_jitter(alpha = .01, width = 80, height = 10, size=0.01, shape=23) +
theme_bw() +
ggtitle("M5: All Variables, Full Dataset (Robust Regression)") +
scale_y_continuous(name='Hits Predicted') +
scale_x_continuous(name='Hits Observed') +
labs(subtitle = paste("RMSE: ", round(as.numeric(regr.eval(dat$hits, rlm.best.vars.entire.df$fitted.values)[3]), 3))) +
theme(axis.text.y = element_text(size=7),
axis.text.x = element_text(size=7),
axis.title.y = element_text(size=7),
axis.title.x = element_text(size=7),
legend.text=element_text(size=12),
legend.title=element_text(size=12),
plot.title = element_text(size=7),
legend.position="bottom")
rlm.best.vars.truncated.df.plot = ggplot(dat.cook.dffits, aes(
x = hits,
y = predict.lm(rlm.best.vars.truncated.df, type="response"))) +
geom_smooth(method = "lm", colour = "red") +
geom_jitter(alpha = .01, width = 80, height = 10, size=0.01, shape=23) +
theme_bw() +
ggtitle("M6: Best Variables, Fixed CooksD and DFFITs Dataset") +
scale_y_continuous(name='Hits Predicted') +
scale_x_continuous(name='Hits Observed') +
labs(subtitle = paste("RMSE: ", round(as.numeric(regr.eval(dat.cook.dffits$hits, rlm.best.vars.truncated.df$fitted.values)[3]), 3))) +
theme(axis.text.y = element_text(size=7),
axis.text.x = element_text(size=7),
axis.title.y = element_text(size=7),
axis.title.x = element_text(size=7),
legend.text=element_text(size=12),
legend.title=element_text(size=12),
plot.title = element_text(size=7),
legend.position="bottom")
full.model.poisson.plot = ggplot(dat, aes(
x = hits,
y = predict(full.model.poisson, type="response"))) +
geom_smooth(method = "lm", colour = "red") +
geom_jitter(alpha = .01, width = 80, height = 10, size=0.01, shape=23) +
theme_bw() +
ggtitle("M7: All Variables, Full Dataset (Poisson Regression)") +
scale_y_continuous(name='Hits Predicted') +
scale_x_continuous(name='Hits Observed') +
labs(subtitle = paste("RMSE: ", round(RMSE(full.model.poisson),3))) +
theme(axis.text.y = element_text(size=7),
axis.text.x = element_text(size=7),
axis.title.y = element_text(size=7),
axis.title.x = element_text(size=7),
legend.text=element_text(size=12),
legend.title=element_text(size=12),
plot.title = element_text(size=7),
legend.position="bottom")
nbinom.full.plot = ggplot(dat, aes(
x = hits,
y = predict(nbinom.full, type="response"))) +
geom_smooth(method = "lm", colour = "red") +
geom_jitter(alpha = .01, width = 80, height = 10, size=0.01, shape=23) +
theme_bw() +
ggtitle("M8: All Variables, Full Dataset (Negative Binomial Reg.)") +
scale_y_continuous(name='Hits Predicted') +
scale_x_continuous(name='Hits Observed') +
labs(subtitle = paste("RMSE: ", round(RMSE(nbinom.full),3))) +
theme(axis.text.y = element_text(size=7),
axis.text.x = element_text(size=7),
axis.title.y = element_text(size=7),
axis.title.x = element_text(size=7),
legend.text=element_text(size=12),
legend.title=element_text(size=12),
plot.title = element_text(size=7),
legend.position="bottom")
Now, lets plot it.
p_load(ggpubr)
ggarrange(all.vars.model.ols.plot,
all.vars.model.ols.cook.plot,
all.vars.model.ols.cook.dffits.plot,
best.vars.model.ols.plot,
rlm.best.vars.entire.df.plot,
rlm.best.vars.truncated.df.plot,
full.model.poisson.plot,
nbinom.full.plot,
ncol = 2, nrow = 4
)
According to all these analyses, Model 2 (i.e. linear, with corrected Cook’s Distances) does the job the best. It has also the best fit.
Now I will generate the first deliverable.
dat.deliverable$hits[dat.deliverable$hits == "\\N"] <- NA # Declare NAs
csv.dat = data.frame( # Generate the DF
row_num = dat.deliverable$row_num[is.na(dat.deliverable$hits)],
hits = dat.deliverable$hits[is.na(dat.deliverable$hits)]
)
write.csv(csv.dat,'deliverable1.csv') # Export DF to CSV
The second deliverable will be the .rmd file.
The third one, will be the PDF that the .rmd generates.